Tera-Scale 1D FFT with Low-Communication Algorithm and IntelR

نویسندگان

  • Jongsoo Park
  • Ganesh Bikshandi
  • Karthikeyan Vaidyanathan
  • Ping Tak
  • Peter Tang
  • Pradeep Dubey
  • Daehyun Kim
چکیده

This paper demonstrates the first tera-scale performance of Intel © Xeon Phi TM coprocessors on 1D fft computations. Applying a disciplined performance programming methodology of sound algorithm choice, valid performance model, and well-executed optimizations, we break the tera-flop mark on a mere 64 nodes of Xeon Phi and reach 6.7 tflops with 512 nodes, which is 1.5× than achievable on a same number of Intel © Xeon © nodes. It is a challenge to fully utilize the compute capability presented by many-core widevector processors for bandwidth-bound fft computation. We leverage a new algorithm, Segment-of-Interest fft, with low inter-node communication cost, and aggressively optimize data movements in node-local computations, exploiting caches. Our coordination of low communication algorithm and massively parallel architecture for scalable performance is not limited to running fft on Xeon Phi; it can serve as a reference for other bandwidth-bound computations and for emerging hpc systems that are increasingly communication limited.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Development of an FPGA-Based Two-Transform Pulse Compressor Mr. Skip

Recent advances in Field Programmable Gate Array (FPGA) technologies have resulted in high gate count and high performance FPGA parts which offer a cost-effective and short development cycle solution for computation intensive signal processor applications. These parts provide an attractive middle ground between Commercial Off-the-Shelf (COTS) boards employing Digital Signal Processor (DSP) chip...

متن کامل

Parallel Implementations of the Split-Step Fourier Method for Solving Nonlinear Schrödinger Systems

We present a parallel version of the well-known Split-Step Fourier method (SSF) for solving the Nonlinear Schrödinger equation, a mathematical model describing wave packet propagation in fiber optic lines. The algorithm is implemented under both distributed and shared memory programming paradigms on the Silicon Graphics/Cray Research Origin 200. The 1D Fast-Fourier Transform (FFT) is paralleliz...

متن کامل

Parallel Implementation of Multidimensional Transforms without Interprocessor Communication

ÐThis paper presents a modular algorithm which is suitable for computing a large class of multidimensional transforms in a general purpose parallel environment without interprocessor communication. Since it is based on matrix-vector multiplication, it does not impose restrictions on the size of the input data as many existing algorithms do. The method is fully general since it does not depend o...

متن کامل

Using WPT as a New Method Instead of FFT for ‌Improving the Performance of OFDM Modulation

Orthogonal frequency division multiplexing (OFDM) is used in order to provide immunity against very hostile multipath channels in many modern communication systems.. The OFDM technique divides the total available frequency bandwidth into several narrow bands. In conventional OFDM, FFT algorithm is used to provide orthogonal subcarriers. Intersymbol interference (ISI) and intercarrier interferen...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013